Statistical inference and data mining: false discoveries control

نویسندگان

  • Stéphane Lallich
  • Olivier Teytaud
  • Elie Prudhomme
چکیده

Data Mining is characterised by its ability at processing large amounts of data. Among those are the data ”features”variables or association rules that can be derived from them. Selecting the most interesting features is a classical data mining problem. That selection requires a large number of tests from which arise a number of false discoveries. An original non parametric control method is proposed in this paper. A new criterion, UAFWER, defined as the risk of exceeding a pre-set number of false discoveries, is controlled by BS FD, a bootstrap based algorithm that can be used on oneor two-sided problems. The usefulness of the procedure is illustrated by the selection of differentially interesting association rules on genetic data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of False Correction Strategy and Inference Strategy on the Paramedical Students’ Reading Comprehension and Attitude

There is a bulk of studies supporting the positive effect of strategy instruction on reading comprehension. This study examined the effect of two reading strategies (i.e., false correction and inference strategy) on English reading comprehension of Iranian paramedical students, using a pretest, posttest, control group design. It also surveyed their attitudes toward the effect and usefulness of ...

متن کامل

A Tutorial on Statistically Sound Pattern Discovery

Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries — patterns that are found in the sample data but do not hold in the wider popul...

متن کامل

Controlling false discoveries in genome scans for selection.

Population differentiation (PD) and ecological association (EA) tests have recently emerged as prominent statistical methods to investigate signatures of local adaptation using population genomic data. Based on statistical models, these genomewide testing procedures have attracted considerable attention as tools to identify loci potentially targeted by natural selection. An important issue with...

متن کامل

Association Rule Interestingness: Measure and Statistical Validation

The search for interesting Boolean association rules is an important topic in knowledge discovery in databases. The set of admissible rules for the selected support and con dence thresholds can easily be extracted by algorithms based on support and con dence, such as Apriori. However, they may produce a large number of rules, many of them are uninteresting. One has to resolve a two-tier problem...

متن کامل

Statistical Detection of EEG Synchrony Using Empirical Bayesian Inference

There is growing interest in understanding how the brain utilizes synchronized oscillatory activity to integrate information across functionally connected regions. Computing phase-locking values (PLV) between EEG signals is a popular method for quantifying such synchronizations and elucidating their role in cognitive tasks. However, high-dimensionality in PLV data incurs a serious multiple test...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006